Bproperty is a property solutions provider, for both tenants and owners, in Bangladesh. They cater to the needs of those seeking real estate services offering a platform that enables anyone to buy, rent or sell properties in the country. After a owner has requested to put their real estate advertisement on Bproperty's website, Bproperty contacts the owner and sends a representative to scout the property. As a result no individual can put their property listing directly in the Bproperty's website and hence, the probability of fake listings are minimal. In this project, about 50 thousand rent listings in Bproperty.com from five cities of Bangldesh was extracted and analyzed.
The main objective here is to build a model that will predict the rent of an apartment with maximum possible accuracy.
This model can be used by the following parties:
import pandas as pd
from matplotlib import pyplot as plt
from google.colab import drive
from bs4 import BeautifulSoup
import requests
For this project, from bproperty's all listings apartments to be rented will be used. The same code can be replicated to collect data for other property types: room, duplex, plaza, building, plot, office, shop, etc. from bproperty.com.
Bproperty's has almost 45-50 thousand active listings of apartments to date and at a time details of 24 listings can be viewed in one page. So about 2084 pages should be there for apartment listings. Page numbers are used to differentiate the URLs, meaning the page 24 will have the URL: main url(https://www.bproperty.com/en/bangladesh/apartments-for-rent/) + "page-" + page number(24) --> https://www.bproperty.com/en/bangladesh/apartments-for-rent/page-24/. So, a loop has been created to visit 2000+ webpages and collect the links, containing details, for each of the apartment.
listings_links = []
#a string will be attached to main_url to create specific url for each webpages
main_url = "https://www.bproperty.com/en/bangladesh/apartments-for-rent/"
url = main_url
#loop to visit webpages that contains 24 links to each apartment details
for page_number in range(1, 2084):
#getting HTML of a particular webpage in text format
raw_data = requests.get(url).text
#parsing the HTML collected from the webpage
soup = BeautifulSoup(raw_data, 'html5lib')
#in each webpage there are short profiles and a link to details of 24 different apartment listings.
#The required links for this project are saved in class named "_287661cb" within tag "a"
#creating a loop to collect all the available links within the mentioned class and tag
for link in soup.find_all("a", class_ = "_287661cb", href = True):
listings_links.append(link['href'])
#updating the url to visit next webpage
url = main_url + "page-" + str(page_number)
#mounting google drive
drive.mount('/content/gdrive')
Mounted at /content/gdrive
#saving the urls in drive
filename = "links.csv"
#converting the list, containing the desired links, into dataframe and saving it
links_csv = pd.DataFrame(listings_links)
links_csv.to_csv(filename)
# Path for my google drive Folder
!cp $filename '/content/gdrive/My Drive/data science projects/'
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
The list should containts almost 50,000 links to the details of apartments. The details are number of bedrooms, number of baths, the size measured in sqft, rent, address (city, neighborhood, area), and features. For simplicity of analysis instead of details of features, only the total number of features will be extracted.
N.B.: the links collected in the last section contains only a fragment of the link. So the main url will be used as the base and the fragmented links will be attached to in the loop
#creating a dataframe with the columns for the variables available and required for this project from details page
bproperty_data = pd.DataFrame(columns=["bedrooms",
"baths",
"sqft",
"rent",
"address",
"added_date",
"features",
"link"])
#loading the links to details page of all the apartments from drive (incase a colab new session has started)
#listings_links = pd.read_csv("/content/gdrive/My Drive/data science projects/links.csv")['0']
#print(listings_links[0:5])
#the base url
main_url = "https://www.bproperty.com"
#creating to loop to visit each of the link available and extract our desired datapoints
for url_fragment in listings_links[bproperty_data.shape[0]:len(listings_links)]:
#attaching the fragmented link with the main url
url = main_url + url_fragment
#fetching and parsing the HTML
scraped_data = requests.get(url).text
soup = BeautifulSoup(scraped_data, "html5lib")
#extracting datapoints as text from identified tag, class, and position from parsed HTML
bedrooms = soup.find_all("span", class_ = "fc2d1086")[0].text
baths = soup.find_all("span", class_ = "fc2d1086")[1].text
sqft = soup.find_all("span", class_ = "fc2d1086")[2].find("span").text
address = soup.find("div", class_ = "_1f0f1758").text
added_date = soup.find_all("span", class_ = "_812aa185")[3].text
features = len(soup.find_all("span", class_ = "_005a682a"))
rent = soup.find("span", class_ = "_105b8a67").text
#attaching the datapoints in relevant columns in the main dataframe
bproperty_data = bproperty_data.append({"bedrooms": bedrooms,
"baths": baths,
"sqft": sqft,
"rent": rent,
"address": address,
"added_date": added_date,
"features": features,
"link": url},
ignore_index=True)
print(bproperty_data.shape)
#saving the urls in drive
#mounting google drive
drive.mount('/content/gdrive')
# Path for my google drive Folder
filename = "bproperty_data.csv"
bproperty_data.to_csv(filename)
!cp $filename '/content/gdrive/My Drive/data science projects/'
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
#loading the main dataset from drive (incase a colab new session has started)
df = pd.read_csv("/content/gdrive/My Drive/data science projects/predicting house rents/bproperty_data.csv")
df = pd.read_csv("bproperty_data.csv")
df.head()
| Unnamed: 0 | bedrooms | baths | sqft | rent | address | added_date | features | link | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2 Beds | 2 Baths | 1,000 sqft | 12,500 | Darjiban Mosque Road, Dargi Para, Sylhet | 14-Nov-21 | 6 | https://www.bproperty.com/en/property/details-... |
| 1 | 1 | 5 Beds | 3 Baths | 1,800 sqft | 20,000 | Mojumdarpara Road, Mazumder Para, Sylhet | 14-Nov-21 | 7 | https://www.bproperty.com/en/property/details-... |
| 2 | 2 | 2 Beds | 2 Baths | 900 sqft | 12,000 | Masjid Lane Society, Sholokbahar, Chattogram | 14-Nov-21 | 8 | https://www.bproperty.com/en/property/details-... |
| 3 | 3 | 2 Beds | 2 Baths | 900 sqft | 13,000 | Masjid Lane Society, Sholokbahar, Chattogram | 14-Nov-21 | 8 | https://www.bproperty.com/en/property/details-... |
| 4 | 4 | 2 Beds | 2 Baths | 1,000 sqft | 14,000 | Uddipon R/A, Mira Bazar, Sylhet | 14-Nov-21 | 6 | https://www.bproperty.com/en/property/details-... |
The address column describes the city, neighborhood, and area name. Address column can be split to find city name and neighbrhood name as they are separated by comma. The first part of the address is area name; after first comma, in second part the address column has the neighborhood name; finally after second comma, in third part of the address remains the city name. Other than working with the address column, two columns: "unnamed: 0", and "link", "added_date" will be dropped.
#Dropping the mentioned columns
df_cleaned = df.drop(labels = ["Unnamed: 0", "link", "added_date"], axis = 1)
df_cleaned.head()
| bedrooms | baths | sqft | rent | address | features | |
|---|---|---|---|---|---|---|
| 0 | 2 Beds | 2 Baths | 1,000 sqft | 12,500 | Darjiban Mosque Road, Dargi Para, Sylhet | 6 |
| 1 | 5 Beds | 3 Baths | 1,800 sqft | 20,000 | Mojumdarpara Road, Mazumder Para, Sylhet | 7 |
| 2 | 2 Beds | 2 Baths | 900 sqft | 12,000 | Masjid Lane Society, Sholokbahar, Chattogram | 8 |
| 3 | 2 Beds | 2 Baths | 900 sqft | 13,000 | Masjid Lane Society, Sholokbahar, Chattogram | 8 |
| 4 | 2 Beds | 2 Baths | 1,000 sqft | 14,000 | Uddipon R/A, Mira Bazar, Sylhet | 6 |
#removing string values from datapoints of bedrooms and baths column
df_cleaned["bedrooms"] = df_cleaned["bedrooms"].str.split(" ", expand = True)[0]
df_cleaned["baths"] = df_cleaned["baths"].str.split(" ", expand = True)[0]
df_cleaned["sqft"] = df_cleaned["sqft"].str.split(" ", expand = True)[0]
#checking if all the values of bedrooms column after cleaning has any strings in it
df_cleaned['bedrooms'].value_counts().to_frame()
| count | |
|---|---|
| bedrooms | |
| 2 | 27138 |
| 3 | 18745 |
| 1 | 2635 |
| 4 | 1289 |
| 5 | 34 |
| 6 | 5 |
| 7 | 1 |
| Studio | 1 |
#removing the only string value "Studio" from the bedrooms column
df_cleaned = df_cleaned[df_cleaned.bedrooms != "Studio"]
#checking if all the values of baths column after cleaning has any strings in it
df_cleaned["baths"].value_counts().to_frame()
| count | |
|---|---|
| baths | |
| 2 | 26753 |
| 3 | 11757 |
| 1 | 8761 |
| 4 | 2376 |
| 5 | 199 |
| 6 | 1 |
#removing the comma from numerical values in column "sqft" and "rent"
#converting the columns into float type
df_cleaned["rent"] = df_cleaned["rent"].str.replace(",", "").astype("float")
df_cleaned["sqft"] = df_cleaned["sqft"].str.replace(",", "").astype("float")
df_cleaned["baths"] = df_cleaned["baths"].astype("float")
df_cleaned["bedrooms"] = df_cleaned["bedrooms"].astype("float")
#extracting the city and neighborhood names from the comma separated address column
df_cleaned["city"] = df_cleaned["address"].str.split(",", expand = True)[2]
df_cleaned["neighborhood"] = df_cleaned["address"].str.split(",", expand = True)[1]
#dropping the address column
df_cleaned = df_cleaned.drop(labels = ["address"], axis = 1)
#getting the frequencies of city column
df_cleaned['city'].value_counts().to_frame()
| count | |
|---|---|
| city | |
| Dhaka | 37345 |
| Chattogram | 8681 |
| Gazipur | 2674 |
| Sylhet | 338 |
| Cumilla | 298 |
| Badda | 243 |
| Malibagh | 8 |
| Mira Bazar | 2 |
| Subid Bazar | 1 |
| Kakrail | 1 |
| Khilgaon | 1 |
The city column has the names of these cities: Dhaka, Chattogram, Gazipur, Sylhet, and Cumilla. But the values - Badda, Malibagh, Mira Bazar, Khilgaon, Kakrail, and Subid Bazar are not any city's name. To identify what caused these values to be present in the city column, the address column from df will be reviewed by observing all the columns generated from comma separation.
#taking in all columns to identify aforementioned issue
com_add = df["address"].str.split(",", expand = True)
com_add.head()
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| 0 | Darjiban Mosque Road | Dargi Para | Sylhet | None | None |
| 1 | Mojumdarpara Road | Mazumder Para | Sylhet | None | None |
| 2 | Masjid Lane Society | Sholokbahar | Chattogram | None | None |
| 3 | Masjid Lane Society | Sholokbahar | Chattogram | None | None |
| 4 | Uddipon R/A | Mira Bazar | Sylhet | None | None |
Here it is visible that few rows had more than two commas, causing the neighborhood names to be in the third part or in column 2 of the above table.
#crosschecking if the hypothesis that city names went to column 3 is true by observing Badda's rows since it had maximum faulty occurence
com_add[com_add.iloc[:, 2] == " Badda"]
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| 552 | South Baridhara Residential Area | D. I. T. Project | Badda | Dhaka | None |
| 713 | South Baridhara Residential Area | D. I. T. Project | Badda | Dhaka | None |
| 4115 | South Baridhara Residential Area | D. I. T. Project | Badda | Dhaka | None |
| 4117 | South Baridhara Residential Area | D. I. T. Project | Badda | Dhaka | None |
| 4148 | South Baridhara Residential Area | D. I. T. Project | Badda | Dhaka | None |
| ... | ... | ... | ... | ... | ... |
| 49687 | South Baridhara Residential Area | D. I. T. Project | Badda | Dhaka | None |
| 49699 | South Baridhara Residential Area | D. I. T. Project | Badda | Dhaka | None |
| 49825 | South Baridhara Residential Area | D. I. T. Project | Badda | Dhaka | None |
| 49829 | South Baridhara Residential Area | D. I. T. Project | Badda | Dhaka | None |
| 49842 | South Baridhara Residential Area | D. I. T. Project | Badda | Dhaka | None |
243 rows × 5 columns
Instead of just replacing the faulty occurences with the most frequent city in the whole dataset, the neighboorhood groups were filtered to find most frequent city for each of the neighborhood (as multiple cities can have presence of neighborhoods with same name) and then that neighborhood name would then be replaced with the found city name.
#replacing neighborhood name with the most frequent city name for that neighboorhood in city column
city_replace = [" Badda", " Malibagh", " Mira Bazar", " Khilgaon", " Kakrail", " Subid Bazar"]
for i in city_replace:
replacement = df_cleaned[df_cleaned.neighborhood == i]["city"].value_counts().idxmax()
print(i, "will be replaced with", replacement)
df_cleaned["city"].replace(i, replacement, inplace = True)
Badda will be replaced with Dhaka Malibagh will be replaced with Dhaka Mira Bazar will be replaced with Sylhet Khilgaon will be replaced with Dhaka Kakrail will be replaced with Dhaka Subid Bazar will be replaced with Sylhet
#checking if the city column looks alright now
df_cleaned["city"].value_counts().to_frame()
| count | |
|---|---|
| city | |
| Dhaka | 37598 |
| Chattogram | 8681 |
| Gazipur | 2674 |
| Sylhet | 341 |
| Cumilla | 298 |
df_cleaned.head()
| bedrooms | baths | sqft | rent | features | city | neighborhood | |
|---|---|---|---|---|---|---|---|
| 0 | 2.0 | 2.0 | 1000.0 | 12500.0 | 6 | Sylhet | Dargi Para |
| 1 | 5.0 | 3.0 | 1800.0 | 20000.0 | 7 | Sylhet | Mazumder Para |
| 2 | 2.0 | 2.0 | 900.0 | 12000.0 | 8 | Chattogram | Sholokbahar |
| 3 | 2.0 | 2.0 | 900.0 | 13000.0 | 8 | Chattogram | Sholokbahar |
| 4 | 2.0 | 2.0 | 1000.0 | 14000.0 | 6 | Sylhet | Mira Bazar |
df_cleaned.info()
<class 'pandas.core.frame.DataFrame'> Index: 49847 entries, 0 to 49847 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 bedrooms 49847 non-null float64 1 baths 49847 non-null float64 2 sqft 49847 non-null float64 3 rent 49847 non-null float64 4 features 49847 non-null int64 5 city 49592 non-null object 6 neighborhood 49841 non-null object dtypes: float64(4), int64(1), object(2) memory usage: 3.0+ MB
As it is mandatory to insert all the information before posting a house to rent in bproperty, we, fortunately, do not have any missing values.
df_cleaned.describe()
| bedrooms | baths | sqft | rent | features | |
|---|---|---|---|---|---|
| count | 49847.000000 | 49847.000000 | 49847.000000 | 4.984700e+04 | 49847.000000 |
| mean | 2.377455 | 2.167493 | 974.394848 | 1.701064e+04 | 11.761510 |
| std | 0.630975 | 0.781340 | 415.955743 | 2.562665e+04 | 6.876985 |
| min | 1.000000 | 1.000000 | 120.000000 | 2.500000e+03 | 0.000000 |
| 25% | 2.000000 | 2.000000 | 700.000000 | 1.050000e+04 | 7.000000 |
| 50% | 2.000000 | 2.000000 | 850.000000 | 1.400000e+04 | 8.000000 |
| 75% | 3.000000 | 3.000000 | 1200.000000 | 1.800000e+04 | 19.000000 |
| max | 7.000000 | 6.000000 | 7000.000000 | 4.530000e+06 | 46.000000 |
The description shows that mean number of bedrooms and baths in bproperty.com is between 2 and 3. Whereas the mean area of the listings is 974 sqft, which goes up to 7000 sqft. Moreover, area column's mean is higher than the median, indicating that majority of the listings have low asking rent and outliers on the right tail pulling the mean rent upward.
df_cleaned["neighborhood"].value_counts().to_frame()
| count | |
|---|---|
| neighborhood | |
| Mirpur | 8570 |
| Mohammadpur | 4817 |
| Gazipur Sadar Upazila | 2661 |
| Uttara | 2387 |
| Jatra Bari | 1666 |
| ... | ... |
| Banglamotors | 1 |
| South Khulsi | 1 |
| Aditya Para | 1 |
| Taltala | 1 |
| Goran | 1 |
188 rows × 1 columns
From the frequencies, it is evident that Mirpur, Mohammadpur, Gazipur Sadar, Uttara, Jatrabari are the top 5 neighborhoods in terms of number of rent listings. The total number of neighborhoods, listed in Bproperty in all the 5 cities combined, is 188.
#importing necessary libraries for visualizations
import plotly.express as px
import seaborn as sns
fig = px.scatter(
data_frame=df_cleaned[(df_cleaned['sqft'] <= 5000)], #to remove extreme outliers for a better view at the relation
x="sqft",
y="rent",
size="bedrooms",
color="baths",
hover_name="city",
size_max=20,
opacity = 0.6
)
fig.show()
The scatterplot above shows the relationship between number of bathroom, number of bedroom, and sqft of the house or flat with the rent of the house or flat. The chart depicts the number of bedroom as the size of each listing. It is evident from the naked eye that the average size of the bubbles are increasing as we go from left to right in the plot indicating strong relationship between size of the apartment and number of bedrooms. But the relationship between rent and number of bedroom is not differentiable from naked eye.
Same relationship is also visible for number of bathrooms as well, indicating weak rent-baths relationship.
However a slight linear relationship can be seen in the graph between area of the apartments in sqft and rent of the apartments.
#printing correlations
print(df_cleaned[['rent', 'sqft']].corr())
print(df_cleaned[['rent', 'baths']].corr())
print(df_cleaned[['rent', 'bedrooms']].corr())
print(df_cleaned[['rent', 'features']].corr())
rent sqft
rent 1.000000 0.457137
sqft 0.457137 1.000000
rent baths
rent 1.000000 0.316865
baths 0.316865 1.000000
rent bedrooms
rent 1.000000 0.285499
bedrooms 0.285499 1.000000
rent features
rent 1.000000 0.212289
features 0.212289 1.000000
The correlation matrix matching the explanation of visualization shows only the correlation of area in sqft and rent to be considerable.
#plotting the rent in boxplots according to cities from address3 column
fig = px.box(df_cleaned[df_cleaned["rent"] < 100000], x="city", y="rent")
fig.show()
According to the boxplots, Dhaka has the highest median rent of 15k among all the 5 cities. It was predictable as is most populous city (scarcity of land) among all the other cities. After Dhaka comes Chittagong, Sylhet, and Cumilla with median asking rate of rent hovering from 11k to 12k. Gazipur has the lowest median asking amount of rent with 9k. So it is clear that the average rent varies significantly from city to city. Hence, this factor could also be significant while predicting rent. Apart from the median rent, boxplot is also showing a considerable number of positive outliers in Dhaka and Chittagong's rent distribution.
To predict rents, random forest machine learning technique will be used. Since, the city column is string and random forest method can not analyze a categorical variable, this variable will be replaced with dummy variables before passing into the the random forest algorithm.
#importing preprocessing module from sklearn library to encode categorical variables
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
#creating a new dataframe named "feature" and passingadding bedrooms, baths, features, and dummy variables from city column in that dataframe
catg_vars = ["city", "neighborhood", "rent"]
num_vars = df_cleaned.drop(labels = catg_vars, axis = 1)
num_vars.head()
feature = pd.concat([num_vars, pd.get_dummies(df_cleaned["city"])], axis = 1)
feature.head()
| bedrooms | baths | sqft | features | Chattogram | Cumilla | Dhaka | Gazipur | Sylhet | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.0 | 2.0 | 1000.0 | 6 | False | False | False | False | True |
| 1 | 5.0 | 3.0 | 1800.0 | 7 | False | False | False | False | True |
| 2 | 2.0 | 2.0 | 900.0 | 8 | True | False | False | False | False |
| 3 | 2.0 | 2.0 | 900.0 | 8 | True | False | False | False | False |
| 4 | 2.0 | 2.0 | 1000.0 | 6 | False | False | False | False | True |
#separating the dependent and independent variables.
#dependent variable
y = df_cleaned['rent'].values
print(y)
#independent variables
x = feature.values
x
[12500. 20000. 12000. ... 9000. 25000. 13000.]
array([[2.0, 2.0, 1000.0, ..., False, False, True],
[5.0, 3.0, 1800.0, ..., False, False, True],
[2.0, 2.0, 900.0, ..., False, False, False],
...,
[1.0, 1.0, 500.0, ..., True, False, False],
[3.0, 3.0, 1580.0, ..., False, False, False],
[3.0, 2.0, 900.0, ..., True, False, False]], dtype=object)
Using the train test split function the dataset will be split into two parts: training set and testing set. The randomforest model will be optimized using the training dataset. Afterwards, the model will be evaluated on a completely new dataset for the model- testing set. Random_state is being used to fix the split, meaning - during onl one random daatset will be generated and used throughout while optimizing the model. But after model finalization this will no longer be used so that the model can analyze true random split. The dataset is bein split into training and test data by 7:3 ratio.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=11)
#importing random forest module
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
#fitting the training datasets in the model
model.fit(X_train, y_train)
RandomForestRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestRegressor()
The model has been built from the training dataset using Random Forest technique. Now, this model will be applied on the test dataset to predict the rent. X_test includes the independent variables- sqft, city, number of bathrooms number of features and number of bedrooms. These variables will be inserted into the model to predict the rent of the testing dataset.
y_pred = model.predict(X_test)
Now that rent has been predicted, this prediction will be used to evaluate the model's accuracy. For evaluation, the statistical tool- R square will be applied. R-square measures what proportion of the dependent variable's variation is being explained by the model using the given independent variables.
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
0.7272832643918732
Here, the R-squared score is 0.8388, which translates to 83.88% variation of the dependent variable being predicted or explained by the model that was developed.
width = 10
height = 8
plt.figure(figsize=(width, height))
ax1 = sns.distplot(y_pred, hist=False, color="r")
ax2 = sns.distplot(y_test, hist=False, color="b", ax=ax1)
plt.title('Distribution plot of predicted values by the model and actual values from test dataset')
plt.xlabel('Rent (in Taka)')
plt.ylabel('Proportion of listings')
plt.show()
plt.close()
The graph shows that almost all variations were predicted by the model, except for the high saturation in the lower segment of the rent distribution